feat(sedona-raster-zarr): cloud storage backends (s3, gcs, azure, http) via object_store#888
Open
james-willis wants to merge 3 commits into
Open
feat(sedona-raster-zarr): cloud storage backends (s3, gcs, azure, http) via object_store#888james-willis wants to merge 3 commits into
james-willis wants to merge 3 commits into
Conversation
5af7c08 to
946e6a4
Compare
2 tasks
3fbe051 to
6e79a90
Compare
…ect_store
ZarrChunkReader now accepts file://, bare paths, s3://, gs:// / gcs://,
az:// / abfs:// / abfss://, and http:// / https://. The reader's
public API is pure-storage-in: hand it an
Arc<dyn AsyncReadableListableStorageTraits> and a group URI (for
chunk-anchor URI formatting) and it walks the chunk grid. Where the
storage came from — registry lookup, env-var-built builder, an
in-memory test fixture — is the caller's concern, not the reader's.
* `open_storage_from_uri(uri, store_override: Option<Arc<dyn ObjectStore>>)`
is the helper for callers that don't already hold a credentialed
store. `Some(store)` uses it directly (with PrefixStore for
s3/gs/az; http(s) is rooted at the URL by its builder). `None`
falls back to per-scheme *Builder::from_env().with_url(uri).build()
— the same env-var credential discovery that read_format's
orchestrated path gets via ensure_object_store_registered_with_options.
file:// and bare paths always go through zarrs_filesystem's
FilesystemStore wrapped in SyncToAsyncStorageAdapter, so the local
backend shows up on the same async storage surface.
* ZarrChunkReader::try_new is async; the sync RecordBatchReader
streaming surface is unchanged since next() is pure CPU.
* No dedicated tokio runtime; async tasks run on whatever runtime
the caller is on (DataFusion's executor for the future SQL UDTF,
an ad-hoc current-thread runtime block_on'd in PyZarrChunkReader::new
for the Python FFI, #[tokio::test] in tests).
* PyZarrChunkReader::new builds storage via open_storage_from_uri
(env-var credentials by default), then drives the async try_new
through a local current-thread tokio runtime. No synthetic
RuntimeEnv, no datafusion-execution dep on the FFI cdylib.
* Replaces Group::child_arrays (which opens every child up front and
errors hard on any per-array metadata failure) with two purpose-
built paths driven by the caller's arrays filter:
- arrays: Some([...]) opens each by name with Array::async_open.
No listing — usable against backends that can't list (plain HTTPS
without WebDAV, S3-via-HttpStore).
- arrays: None lists direct children with storage.list_dir, then
Array::async_open each. Per-array open failures are logged at
warn! and skipped, so a single malformed sibling (e.g. an
xarray-style fixed-length-Unicode coord variable with a null
fill_value that zarrs 0.23 can't open) no longer poisons the
rest of the group.
* zarrs_object_store pinned to 0.5 in the workspace; 0.6 depends on
object_store 0.13, semver-incompatible with DataFusion 52's
object_store 0.12.x.
* Cloud smoke tests pass strictly against the public anonymous
ITS_LIVE v2 ice-velocity datacubes (s3://its-live-data/...). Same
bucket via s3:// (AmazonS3Builder) and via the virtual-hosted
HTTPS URL (HttpStore), with an explicit M11/M12 filter to avoid
listing on the HTTPS path.
6e79a90 to
531283a
Compare
paleolimbot
reviewed
Jun 2, 2026
Comment on lines
+110
to
+120
| fn default_object_store_for_uri(uri: &str) -> Result<Arc<dyn ObjectStore>, PyErr> { | ||
| // file:// and bare paths use a LocalFileSystem rooted at `/`; | ||
| // open_storage_from_uri prefixes it at the group's path. This | ||
| // matches the store the host's ObjectStoreRegistry yields for | ||
| // file://, so the loader treats local and cloud uniformly. | ||
| if uri.starts_with("file://") || !uri.contains("://") { | ||
| return Ok(Arc::new(object_store::local::LocalFileSystem::new())); | ||
| } | ||
| let url = Url::parse(uri) | ||
| .map_err(|e| PyValueError::new_err(format!("group URI {uri:?} is not a valid URL: {e}")))?; | ||
| match url.scheme().to_ascii_lowercase().as_str() { |
Member
There was a problem hiding this comment.
There are some tests in the sedona crate (I think) that smoke test these openers that you can steal. Alternatively, include fewer options since we'll remove some (not sure HTTP applies, unless there's some directory listing stuff I don't know about, and gcs/azure could probably be dropped for now).
Contributor
Author
There was a problem hiding this comment.
This is the temporary code until we pipe the store through. I'm not sure how much testing we want to add just for that
Member
There was a problem hiding this comment.
I think you can just remove the untested code here (filesystem and S3 should suffice to actually use this until we pipe through the store)
PyZarrChunkReader::new built a fresh current-thread tokio runtime on every open to bridge the async loader constructor into the sync Python FFI. Replace it with a process-wide OnceLock<Runtime> so the package builds the runtime once and reuses it. next() on the reader is pure CPU and never touches it.
Add the NASA JPL ITS_LIVE project/data URL (https://its-live.jpl.nasa.gov/) next to the cloud smoke-test bucket constants so the dataset the #[ignore] tests hit is self-documenting. Drop the `datafusion` umbrella and `async-trait` dependencies: neither is referenced anywhere in the crate. The only datafusion use is the sedona_internal_datafusion_err! macro, which expands to datafusion_common::DataFusionError and is satisfied by the retained datafusion-common dependency. Shrinks the compile and the wheel.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds cloud storage backends to
sedona-raster-zarr.ZarrChunkReadernow acceptsfile://, bare paths,s3://,gs:///gcs://,az:///abfs:///abfss://, andhttp:///https://.All Async
This package has been migrated to use all async for record loading and byte retrieval. This allows a single approach (cloud was always going to be async, but file system was originally sync) without shimming sync fs code into the async RS_EnsureLoaded udf and async cloud access code into the sync record loading machinery.
Version pinning
zarrs_object_storeis pinned to0.5in the workspace.0.6(the latest) depends onobject_store 0.13, which is semver-incompatible with theobject_store 0.12.xthe rest of the workspace uses through DataFusion 52.Merge coordination (with #886)
#886 adds the OutDb byte-loading infrastructure (
AsyncByteLoader,OutDbLoadRequest, the ZarrZarrLoader). Whichever of #886/#888 lands second must reconcile:sedona-raster-zarrZarrLoaderonto this PR's all-async storage path (it lands sync in Lazy Raster Loading Support for Zarr and GDAL #886).Arc<dyn ObjectStore>carried onOutDbLoadRequestinto the Zarr loader's reads, usingopen_storage_from_uri(uri, store).